A data-based classification of Slavic languages: Indices of qualitative variation applied to grapheme frequencies

نویسندگان

  • Michaela Koscová
  • Ján Macutek
  • Emmerich Kelih
چکیده

The Ord’s graph is a simple graphical method for displaying frequency distributions of data or theoretical distributions in the two-dimensional plane. Its coordinates are proportions of the first three moments, either empirical or theoretical ones. A modification of the Ord’s graph based on proportions of indices of qualitative variation is presented. Such a modification makes the graph applicable also to data of categorical character. In addition, the indices are normalized with values between 0 and 1, which enables comparing data files divided into different numbers of categories. Both the original and the new graph are used to display grapheme frequencies in eleven Slavic languages. As the original Ord’s graph requires an assignment of numbers to the categories, graphemes were ordered decreasingly according to their frequencies. Data were taken from parallel corpora, i.e., we work with grapheme frequencies from a Russian novel and its translations to ten other Slavic languages. Then, cluster analysis is applied to the graph coordinates. While the original graph yields results which are not linguistically interpretable, the modification reveals meaningful relations among the languages. M. Koščová and J. Mačutek were supported by VEGA grant 2/0047/15. M. Koščová Department of Applied Mathematics and Statistics, Comenius University, Mlynská dolina, SK-84248 Bratislava, Slovakia J. Mačutek Department of Applied Mathematics and Statistics, Comenius University, Mlynská dolina, SK-84248 Bratislava, Slovakia Tel.: +421-2-60295717 Fax: +421-2-65412305 E-mail: [email protected] E. Kelih Department of Slavonic Studies, University of Vienna, Spitalgasse 2, Hof 3, AT-1090 Wien, Austria 2 Michaela Koščová et al.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Towards a General Model of Grapheme Frequencies for Slavic Languages

The present study discusses a possible theoretical model for grapheme frequencies of Slavic alphabets. Based on previous research on Slovene, Russian, and Slovak grapheme frequencies, the negative hypergeometric distribution is presented as a model, adequate for various Slavic languages. Additionally, arguments are provided in favor of the assumption that the parameters of this model can be int...

متن کامل

Differences of pitch profiles in Germanic and slavic languages

This study investigates cross-language differences in pitch range and variation in four languages from two language groups: English and German (Germanic) and Bulgarian and Polish (Slavic). The analysis is based on large multi-speaker corpora (48 speakers for Polish, 60 for each of the other three languages). Linear mixed models were computed that include various distributional measures of pitch...

متن کامل

A Jakobsonian Feature Based Analysis of the Slavic Numeric Quantifier Genitive*

This paper subjects the GB parametric account of variation in Slavic numeral systems put forward in Franks (1995) to critical scrutiny from the perspective of minimalism. It is argued that the true nature of the variation lies in the case contexts in which QPs (phrases in which GEN-Q is assigned) can occur in the different languages. It is further argued that this variation is best understood i...

متن کامل

Speech recognition for east Slavic languages: the case of Russian

In this paper, we present a survey of state-of-the-art systems for automatic processing of recognition of under-resourced languages of the Eastern Europe, in particular, East Slavic languages (Ukrainian, Belarusian and Russian), which share some common prominent features including Cyrillic alphabet, phonetic classes, morphological structure of wordforms and relatively free grammar. A large voca...

متن کامل

vegetation change detection using multi-temporal remotly sensed data during recent three decades by artificial intelligence technique (Case study: protected area of Bashgol)

Quantitative and qualitative information of vegetation and its changes in duration of time as a basic foundation of determination of  habitat quality, priority of protected area and also determination of price of ecosystem services in order to optimum management of natural resources and sustainable development is a very important technical point. In other hand, researchers are interested in rem...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Journal of Quantitative Linguistics

دوره 23  شماره 

صفحات  -

تاریخ انتشار 2016